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Direct solvers currently dominate commercial finite element structural software, 
but do not scale well in the fine granularity regime targeted by emerging parallel pro- 
cessors. Substructure based iterative solvers — often called also domain decomposition 
algorithms — lend themselves better to parallel processing, but must overcome several 
obstacles before earning their place in general purpose structural analysis programs. 
One such obstacle is the solution of systems with many or repeated right hand sides. 
Such systems arise, for example, in multiple load static analyses and in implicit linear 
dynamics computations. Direct solvers are well-suited for these problems because after 
the system matrix has been factored, the multiple or repeated solutions can be obtained 
through relatively inexpensive forward and backward substitutions. On the other hand, 
iterative solvers in general are ill-suited for these problems because they often must 
restart from scratch for every different right hand side. In this paper, we present a 
methodology for extending the range of applications of domain decomposition methods 
to problems with multiple or repeated right hand sides. Basically, we formulate the 
overall problem as a series of minimization problems over /^-orthogonal and supple- 
mentary subspaces, and tailor the preconditioned conjugate gradient algorithm to solve 
them efficiently. The resulting solution method is scalable, whereas direct factorization 
schemes and forward and backward substitution algorithms are not. We illustrate the 
proposed methodology with the solution of static and dynamic structural problems, 
and highlight its potential to outperform forward and backward substitutions on paral- 
lel computers. As an example, we show that for a linear structural dynamics problem 
with 11640 degrees of freedom, every time-step beyond time-step 15 is solved in a single 
iteration and consumes 1.0 second on a 32 processor iPSC-860 system; for the same 
problem and the same parallel processor, a pair of forward/backward substitutions at 
each step consumes 15.0 seconds. 
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1. Introduction 


Direct solvers currently dominate commercial finite element structural soft- 
ware, essentially because: (a) they are robust and reliable, (b) they are versatile, 
(c) they work well with secondary storage, and (d) they usually outperform the 
class of iterative algorithms that were popular when these codes were originally 
developed. However, with the advent of parallel processing, alternatives to direct 
solvers must be researched because factorization algorithms applied to systems 
arising from the finite element formulation of structural problems are not scalable 
— that is, their performance does not necessarily increase with the number of 
processors. This lack of scalability is illustrated by the following analysis. 

Most parallel skyline solvers that have been recently reported in the literature 
are closely related to the parallel active column equation solver presented in [1], 
In general, clusters of col umns are distributed across the processors in a block 
wrap fashion (Fig.l). At each step k of the factorization, column k is broadcast 
to all processors and the entries of row k are updated in parallel. 
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Fig. 1. Parallel skyline solver 
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Usually, after the mesh nodes are renumbered for optimal storage, the skyline 
structure becomes close to a banded one so that for all computational complexity 
purposes, one can reasonably assume that the system matrix is a banded one 
(Fig. 2). 


b 



Fig. 2. Banded computational model 


Let 6, S a , Si and N p denote respectively the system semi-bandwidth, the 
64-bit floating-point arithmetic peak performance of a single processor of the 
parallel machine, the peak interconnect speed of that machine measured in bytes 
per second, and its number of processors. We assume that all real data are stored 
in 8-byte words. At each step of the factorization process, the computational and 
communication parallel time of the factorization algorithm can be evaluated as 
follows: 


[factor — 


T t 


Np X Sa 
8x6 


( 1 ) 


ranamit 


Obviously, the best parallel performance is obtained when T, ransmit << Tfactor' 
Therefore, the saturation or balance condition is given by: 


T transmit 


= T f 


actor 


( 2 ) 
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Given a structural problem and a parallel processor, the above saturation condi- 
tion holds if and only if: 

b = 8 x N p x ^ (3) 

Table 1 reports the values of the system semi-bandwidth that meet the saturation 
condition for today’s emerging parallel processors. Clearly, b = 21942 and b = 
60262 are unrealistic values of the matrix semi-bandwidth. For example, most 
finite element models related to aerospace structures have a semi- bandwidth that 
varies between 300 and 2000. Moreover, problems with a semi-bandwidth as large 
as 21942 or 60262 entail memory and CPU requirements that overwhelm even the 
largest of the currently available supercomputing resources. 


Table 1 


Saturation condition - semi-bandwidth values 

Parallel processor 

N p 

Sa 

Si 

6 

iPSC-860 

128 

60 Mflops 

2.8 Mbytes/sec 

21942 

KSR-1 

256 

20 Mflops 

8.5 Mbytes/sec 

4818 

CM-5 

512 

128 Mflops 

8.7 Mbytes/sec 

60262 


If the saturation condition (2) is enforced for a fixed problem characterized by 
its semi-bandwidth 6, the number of “useful” processors — that is, the number 
of processors beyond which any additional processor can only slow down the 
computations — can be computed from Eqs. (1-2) as follows: 


n; 


b Si 
- x — 
8 S a 


( 4 ) 


Table 2 reports for current massively parallel systems the number of useful proces- 
sors processors for b = 1500. This value of the semi-bandwidth is representative 
of today’s large-scale problems in aerospace structures. 


. Table 2 

Number of useful processors - 6 = 1500 


Parallel processor 

N p 

S a 

5; 

N u 
1 V 

n;/n p 

iPSC-860 

128 

60 Mflops 

2.8 Mbytes/sec 

9 

7.0 % 

KSR-1 

256 

20 Mflops 

8.5 Mbytes /sec 

80 

31.2 % 

CM-5 

512 

128 Mflops 

8.7 Mbytes/sec 

13 

2.5 % 
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Clearly, the relative number of processors that are kept busy by direct solvers on 
today’s massively parallel processors is small. 

In s umm ary, the above analysis shows that parallelism in direct solvers is 
limited by the band width and not by the size of the problem, and that direct 
solvers are not scalable on current parallel hardware. 

REMARK LI. Sub-block decomposition strategies for parallel skyline factor- 
ization could provide slightly higher performance on parallel machines than dis- 
tributed col umn methods [2]. However, the conclusions drawn above hold also 
for these mapping schemes. 

REMARK 1.2. For some large-scale problems, sparse direct solvers can be faster 
than skyline ones. However, it has often been reported that sparse direct solvers 
are even less amenable to parallel processing than skyline ones, even on shared 
memory parallel processors (see, for example, [3]). 

On the other hand, substructure based iterative algorithms — known also as 
domain decomposition methods [4] — are known to have better parallel scalability 
properties. Some of these algorithms also compare favorably with direct solvers 
even on ill-conditioned structural problems discretized with shell and beam el- 
ements [5, 6]. However, these semi-iterative algorithms must overcome several 
obstacles before they can be adopted by structural analysis program developers. 
One such obstacle is the solution of systems with many or repeated right hand 
sides. Such systems arise, for example, in multiple load static analyses, in implicit 
linear dynamics, in eigenvalue problems, and in many other structural computa- 
tions. Direct solvers are well-suited for these problems because after the system 
matrix has been factored, the multiple or repeated solutions can be obtained 
through relatively inexpensive forward and backward substitutions. On the other 
hand, iterative solvers are in general ill-suited for these problems because they 
often must restart from scratch for every different right hand side. 

In this paper, we present a methodology for extending substructure based 
iterative algorithms to problems with many or repeated right hand sides. We for- 
mulate the overall problem as a series of consecutive minimization problems over 
if -orthogonal and supplementary subspaces, and tailor the preconditioned con- 
jugate gradient (PCG) algorithm to solve them efficiently. Basically for each new 
right hand side, we first compute an optimal startup solution via the projection 
of the new interface problem onto an agglomerated Krylov space associated with 
previous right hand sides. Next, we improve this solution with an accelerated 
PCG algorithm where all search directions are orthogonalized with respect to the 
Krylov subspaces generated by previous right hand sides. The resulting solution 
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algorithm is scalable, whereas forward and backward substitution algorithms are 
not. Its convergence rate improves with the number of search directions that are 
stored, and therefore its performance depends on the amount of available mem- 
ory. However for most structural problems, the new solution algorithm entails 
only a fraction of the storage requirements of direct solvers. We illustrate the pro- 
posed methodology with the solution of static and dynamic structural problems, 
and highlight its potential to outperform forward and backward substitutions on 
massively parallel computers. As an example, we show that for a linear struc- 
tural dynamics problem with 11640 degrees of freedom, every time-step beyond 
time-step 15 is solved in a single iteration and consumes 1.0 second on a 32 pro- 
cessor iPSC-860 system; for the same problem and the same parallel processor, 
a pair of forward/backward substitutions consumes at each step 15.0 seconds. 
We hope that the proposed methodology will enhance the versatility of domain 
decomposition based iterative algorithms. 


2. Problem formulation and nomenclature 


For the sake of clarity, we first discuss the problem and the proposed solution 
methodology in the absence of any substructuring technique. In Section 4, we 
highlight the role of substructuring and present the substructure based solution 
algorithm. 

Here, we are interested in solving iteratively the following problems: 

Kui = Si i = 1, N rh3 (5) 


where K, {fi}\ Z i' h ‘, and {u,}| ~ ^ rk ‘, denote respectively the stiffness matrix 
of a given structure, a set of N r ) ls generalized force vectors, and the corresponding 
set of N r hs generalized displacement vectors. Such problems arise, for example, 
when multiple load patterns axe applied to a structure, or when a computational 
algorithm requires repeated solutions of a system of linear equations with the 
same matrix but different right hand sides. 

Problems (5) above can be transformed into the following minimization prob- 
lems: 

min 4>,-(u) = — u T Ku — fj u i = 1, ..., N r h s (6) 

u£K n K 2 

where Nk is the dimension of the stiffness matrix, V, is the set of real numbers, 
and T is the transpose superscript. If each minimization problem in (6) is solved 
with a PCG algorithm, the following Krylov subspaces are generated: 


Si = {- 


(1) J2) 


„(*) 


•S ri) > 


* = 1 ) 


N; 


rhs 


(7) 
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where s\ k ^ and rj < Nk denote respectively the search direction vector at iteration 
Jt, and the number of iterations for convergence of the PCG algorithm applied to 
the minimization of $,(u). Additionally, we introduce the following agglomerated 
subspaces: 

Si = 'Q 5, i = 1, ..., N rh , (8) 

j = i 

Let Si denote the rectangular matrix associated with <S,. From the orthogo- 
nality properties of the conjugate gradient method, it follows that: 


SfKSi = Di i = 1, N rh , 

where 

d ix 0 0 O' 

0 d i3 0 0 

Di = 

0 0 •• 0 

0 0 0 di 

u r » J 


( 9 ) 


J 1 

However, note that in general S t KSi is not a diagonal matrix. Finally, we 
define Si as the matrix whose column vectors also span the subspace but are 
orthogonalized with respect to the stiffness matrix K. Hence, we have: 


range (5;) = Si i = 1, ..., N r h 3 

= 15, i = 1, Nrh, 

where 

'd tl 0 0 O' 

= 0 d tJ 0 0 

D, = 

0 0 - • 0 

0 0 0 d tf 


i=< 





( 10 ) 
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3. Projection and orthogonalization 


Suppose that the first problem Kui = /i has been solved in r x PCG itera- 
tions, and that the N K x n matrix Si associated with the Krylov subspace Si is 
readily available. 

Solving the second problem Ku 2 = fi is equivalent to solving: 

min $2( u ) = 7 : u T Ku — f 2 u (11) 

u^TL n k l 

If 71 Nk is decomposed as follows: 

k Nk = Si ©s* 

dim (Si) = Nk-tx (12) 

Si and S* are K — orthogonal 

then the solution of problem (11) can be written as: 

u 2 = + V 2 

where (13) 

u° 6 Si, v 2 £ S* and u° -^ u 2 = v 2 Ku 2 = 0 

From Eqs. (11-13), it follows that u 2 is the solution of the minimization problem: 

min $ 2 ( u ) = 7 : u T Ku — f 2 u (14) 

u €5, 2 

and v 2 is the solution of the minimization problem: 

min 3>2 (u) — 7 : v t Kv — f 2 v (15) 

2 

First, we consider the solution of problem (14). Since u 2 £ S\, there exists 
a y 2 £ 7Z ri such that: 

= S iy2 ° (16) 

Substituting Eq. (16) into Eq. (14) leads to the following minimization problem: 

min $ 2 (y) = - y T S^KSiy — f^Sjy (17) 

2 
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whose solution is given by: 


SfKjS, y“ = h 
where 

f 2 = Sf f 2 


(18) 


From Eq. (9), it follows that the system of equations (18) is diagonal. Hence, the 
components [y®]; of y° can be simply computed as follows: 



(19) 


Next, we turn to the solution of problem (15) via a PCG algorithm. Since 
the decomposition (13) requires v 2 to be K -orthogonal to u^, at each iteration k, 
the search directions s j must be explicitly ff-orthogonalized to S\. This entails 
the computation of modified search directions Sj ^ as follows: 

4 “ = 4 *> + £ -A* 

where (20) 

s\ 9)T Ks[ k) _ s[ k)T Ks[ q) 

a? " s\ 9)T Ks\* " s^Ks? 

Except for the above modifications, the original PCG algorithm is unchanged. 

In summary, once the first problem K\i\ — f\ has been solved in rj PCG 
iterations and the Nk X ri matrix S\ associated with the Krylov subspace S\ has 
been stored, the second problem Ku 2 = f 2 is solved in two steps as follows: 

Step 1. K is projected onto S\ and the resulting diagonal problem S^KS iy° = 
Sfh is trivially solved in Nk floating-point operations. Next, the par- 

tial solution = S\y 2 is formed. This partial solution u° is an optimal 
startup value for u 2 because: (a) it minimizes u T Ku / 2 — u T f 2 over 
Si c ‘R- Nk , and (b) it is inexpensive to compute. Note that the rq non- 
zero entries of the diagonal matrix S^KS\ are automatically computed 
during the PCG solution of the first problem Ku\ — f\ . Therefore, 
these entries can be stored and need not be re-computed. 

Step 2. The basic PCG algorithm is applied to the solution of Ku 2 = f 2 after it 
is modified to: (a) accept u° as a startup solution, and (b) orthogonalize 

the search directions and S\ with respect to K . 
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The generalization to the case of N r k s right hand sides of the two-step so- 
lution procedure described above goes as follows. Suppose that i* 1 < N r h 3 
consecutive problems Kit,- = /,, i = 1, t* - 1 have been solved with aJ*CG 
algorithm modified as described above, and that the Nk x.G*-i matrix Si-~\ 
associated with the K-orthogonalized agglomerated Krylov subspace <S,*_i has 
been stored. The i*-th problem Kui- = /;• is solved in two steps. First, K is 
projected onto and an optimal startup solution u®. = is computed 

via the trivial solution of the diagonal system . Next, 

the PCG algorithm is applied to the solution of Kv,i- = /,* with u°. as a startup 

solution, and all search directions are K-orthogonalized to S;*_ i. 

Clearly, the performance of the solution method proposed here depends on 
the performance of the preconditioner used in the conjugate gradient algorithm. 
However, it should be noted that from the decomposition (12), it follows that if 
after the PCG solution of some problem indexed by 1 < i* < N r h 3 the dimension 
r,. of the K-orthogonalized Krylov subspace *?,-• becomes equal to the size of one 
problem Nk, the proposed solution method will converge in theory to a direct 
solver. In that case, each remaining problem Ku, = /<, i = i* + 1, N r h 3 will be 
solved directly (zero iteration) and economically as follows (see Eq. (10)): 

«. = Si- y° i = i* + 1, N r h 3 

Where = (21) 

Wly = 3 = 1. Nk 

d 'i 

In practice, the superconvergence behavior outlined above will be reached before 
the point where r,» = Nk, and each of the PCG solutions of the remaining 
N r hs ~ i* problems will converge in two or three iterations. 

REMARK S.l. From Eqs. (11-19), it follows that if two right hand sides /, and 
f i+ j are proportional, the solution of the problem with right hand side /,+ 1 is 
Uj +1 = u° +1 = Siy^+i, ^d therefore, this solution will be found in zero iteration. 


4. Primal and dual substructuring methods 

Despite its elegance and simplicity, the methodology described in Sec- 
tion 3 can be impractical when applied to the global solution of the problems 
Kui = fi, i = 1, N r h s - Indeed, during the PCG solution of the first few problems 
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— that is, before superconvergence can be reached — , the cost of the orthogonal- 
izations implied by Eq. (20) can offset the benefits of convergence acceleration via 
the optimal startup solution and the modified search directions . Moreover, 
storing every search direction and the corresponding matrix- vector product 
Ks\ k ^ can significantly increase the memory requirements of the basic PCG al- 
gorithm. 

However, for symmetric elliptic problems, most powerful preconditioners are 
based on domain decomposition [4, 5], and for such preconditioners, the CPU and 
memory drawbacks outlined above become less important as discussed below. 

Domain decomposition methods are essentially substructuring methods where 
different solution algorithms can be applied to the local and interface problems. 
In these methods, the substructures are seldom physical ones. In general, they are 
obtained by partitioning the global mesh into a number of subdomains following 
some specified criterion [7]. In structural mechanics, the local problems corre- 
spond to the static condensation of the substructure internal degrees of freedom, 
usually via a direct method, and the interface problem corresponds to the eval- 
uation of the substructure interface boundary degrees of freedom, usually via a 
PCG algorithm. Depending on the choice of the substructure interface unknowns, 
two substructuring methods can be formulated: 

l. The primal substructuring method. This is the classical substructuring 
method where the structural degrees of freedom are partitioned into internal ones, 
designated here by the subscript /, and interface boundary ones, designated here 
by the subscript B. The substructure equations of equilibrium are given by: 

Kjj u* = ff~~ u b s = 1> N 3 

s=N, s=N. »=N. (22) 

T: K\ b «* + ^ K*bb u b — X/ fo 

J=1 3 — 1 «=1 

where N 3 denotes the number of substructures. After the internal degrees of 
freedom axe eliminated via static condensation, Eqs. (22) above are transformed 
into the following interface problem: 

IE - K‘ BB - k; t b k;-,' k; b \u B = £ r B - K&Kf,'/; ( 23 ) 

*=1 a=l 

which can be efficiently solved with a substructure based PCG algorithm [5]. 
Because the interface problem (23) is written in terms of a “primal” displacement 
variable ub , we refer to this substructuring method as a primal method. 
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2. The dual substructuring method. This substructuring method is based on 
a variational principle [6,8]. The interface unknowns are chosen as Lagrange 
multipliers representing surface tractions at the substructure interface boundaries 
A. The substructure equations of equilibrium are given by: 

/' - J3’ T A s = 1, N, 

(24) 

0 


K'u’ = 

»=7V. 

E B '“’ = 

3 = 1 


where B* is the finite element matrix associated with the spatial discretization 
of the Lagrange multipliers A. The substructure displacement fields u* can be 
eliminated from Eqs. (25) to obtain an interface problem written in terms of the 
“dual” variables A: 


s~N t s=N t 

[ Y B'K' + B sT ] A = Y B‘K‘ + f s (25) 

3=1 3=1 

where K* + is a generalized inverse of K* that becomes identical to K* when 
substructure s is non-floating [6]. The dual interface problem (25) can be also 
solved efficiently with a substructure based PCG algorithm [6,8,9]. 

The size of the primal interface problem (23) is directly related to the number 
of interface nodes, while the size of the dual interface problem (25) is directly 
related to the discretization type and order of the Lagrange multipliers A. In 
general, the size of the dual problem (25) is less or equal to the size of the primal 
problem (23) [10]. Therefore, whether a primal or dual substructuring method 
is chosen, the size of the interface problem is less or equal to the number of 
interface nodes multiplied by the maximum number of degrees of freedom per 
node. Moreover, computational efficiency in the iterative solution — especially 
on parallel processors — dictates choosing the number of substructures such that 
the n um ber of interface nodes does not exceed 10 to 20 % of the total number of 
nodes in the finite element mesh. 

Therefore, the proposed methodology for solving the multiple problems 
Kui = fi, i = 1, ..., N r hs is computationally feasible when the PCG algorithm 
is used in a substructuring context because: 

• the additional memory requirements entailed by the storage of the search 
directions and the matrix vector products Ks\ k ^ are proportional only 
to the reduced size of the interface problem. 
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• the cost of the orthogonalizations implied by Eq. (20) are negligible 
compared to the cost of the forward and backward substitutions that 
are required at each iteration k for the evaluation of the matrix-vector 
product Kj]'[Kj B Ug*] (primal method), or the matrix- vector product 
K* + [B* T A«] (dual method). 

We have previously shown that during the solution via an iterative dual 
substructuring method of a single problem Ku = /, the spectral pattern of the 
governing matrix (25) causes a rapid loss of orthogonality between the computed 
search directions [11,12]. This result also holds for the solution of a single problem 
Ku = / via an iterative primal substructuring method with a Neumann precondi- 
tioner [5]. The loss of orthogonality between the search directions has a disastrous 
consequence on the convergence rate of the PCG algorithm and can be compen- 
sated only by an explicit re-orthogonalization procedure that is identical to that 
of Eq. (20). In reference [12], we have shown for several large-scale structural 
problems that this re-orthogonalization procedure improves significantly the con- 
vergence rate of the PCG algorithm and reduces drastically its total CPU time, 
without dramatically increasing its storage requirements. 

Hence, the orthogonalization procedure (20) has two positive and synergistic 
effects on the solution via a substructure based iterative methodology of the mul- 
tiple problems Kui = /,-, t = 1, ..., N r h a : (a) when invoked to /iT-re-orthogonalize 
the columns of Si , it accelerates the convergence of the solution of the first prob- 
lem, and thus reduces the number of search directions that must be stored for 
the solution of the next problem, and (b) when applied to K orthogonalize s ) 

and S»-i, it accelerates the convergence of the i-th problem and prepares the 
evaluation of the optimal startup solution of the i-th + 1 problem. 

REMARK 4.1. In practice, the number of search directions that are stored 
for orthogonalization is determined by the memory space that is left available 
after all other storage requirements of the structural analysis have been satisfied. 
When only a few directions can be stored, a partial orthogonalization procedure 
is performed. 

Finally, it should be noted that the solution methodology proposed herein 
involves essentially dot products and matrix-vector multiplications; therefore, it 
scales well in the fine granularity regime targeted by emerging parallel processors, 
whereas direct forward and backward substitutions do not [1]. 
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5. Applications 

Here, we apply the methodology described in the previous sections to the 
solution of repeated systems arising from the static analysis of a stiffened wing 
panel under multiple load conditions, and the linear transient analysis using an 
implicit time-integration scheme of a line-pinched membrane with a circular hole. 
The first problem illustrates the effect on convergence of the number of stored 
search directions, and quantifies the corresponding memory requirements. The 
second problem highlights the superconvergence effects of the methodology in 
the presence of a large number of repeated systems, and demonstrates its parallel 
scalability. The substructure based PCG method selected in both applications is 
the FETI method [6,12] with the “lumped” preconditioner [12,14], In all cases, 
the convergence criterion is set to: 


H *'" 1 ~ Ml < nr 3 (26) 

ll/ilb _ 

We report performance results on the CRAY Y-MP and the iPSC-860 supercom- 
puters that demonstrate the fast convergence and computational efficiency of the 
proposed methodology, and highlight its parallel scalability. 

5.1. Multiple load static analysis 

First, we consider the static analysis of a stiffened wing panel from the V22 
tiltrotor aircraft [13]. The corresponding finite element model (Fig. 3) contains 
9486 nodes, 18272 triangular shell elements with 6 d.o.f. per node, and a total 
number of 56916 degrees of freedom. The panel is clamped at one end and two 
different load cases are applied at the other end: a uniformly distributed bending 
load perpendicular to the main plane of the panel (LI), and a similar load as 
in (LI) but with a non-uniform spatial distribution. The finite element mesh is 
decomposed into 16 subdomains using the Greedy algorithm [7], The resulting 
mesh partition contains 747 interface nodes. All computations are performed on 
a single processor CRAY Y-MP. 
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Fig. 3. Stiffened wing panel from the V22 tiltrotor aircraft 


The structural problem corresponding to load case (LI) is solved using the 
basic FETI iterative method. The size of the dual interface problem is 4482 
Convergence is achieved after 359 iterations and 120 seconds CPU. For this prob 
lem, the FETI code runs at 135 Mflops and requires 15.0 million 64 bit words 
(MW). An optimized direct solver running at 235 Mflops requires 51 MW and 
computes the solution in 199 seconds. The performance of the proposed method 
ology for load case (L2) is reported in Table 3 as a function of the number of 
stored search directions. This number is expressed as a percentage of the 359 
iterations computed during the iterative solution of load case (LI). 


Table 3 

Stiffened wing panel from the V22 tiltrotor aircraft 
Performance of the solution method for load case (L2) 


# of stored (LI) directions (% of 359) 

Total memory requirements (MW) 

# of iterations 

CPU time for orthogonalizations (secs) 
Total CPU time (secs) 


0 

20 

40 

60 

80 

100 

15.0 

14.9 

15.6 

15.3 

15.3 

15.3 

370 

300 

230 

150 

90 

60 

5.2 

5.0 

4.5 

3.3 

2.3 

1.8 

124.0 

100.0 

77.0 

51.0 

30.5 

20.5 


The results reported in Table 3 are in agreement with the theory presented in 
Section 3. The greater is the number of stored (LI) search directions, the better 
is the startup solution (17-19), the smallest is the number of iterations needed 
for solving the minimization problem (15), and the fastest is the convergence of 
the overall methodology. When all of the search directions generated during the 
solution of problem (LI) are stored, the solution time for problem (L2) is reduced 
to less than 17% of the solution time for problem (LI). The reader can also ob- 
serve that for this two load case problem, the total memory requirements of the 
solution procedure is almost independent of the number of stored (LI) search di- 
rections. Indeed, increasing the number of stored (LI) search directions increases 
the storage requirements corresponding to the K orthogonalization of ^ and S i . 
However, it also minimizes the number of iterations for the solution of problem 
(L2), and therefore reduces the storage requirements associated with the K re- 
orthogonalization of S 2 (we recall the reader that the latter re-orthogonalization 
is an intrinsic component of the FETI method that accelerates the convergence 
of the PCG algorithm for a single dual interface problem (25)). In all cases, the 
memory requirements of the proposed iterative solution procedure do not exceed 
31% of the memory requirements of a direct skyline solver. 

While load cases (LI) and (L2) are not proportional, they solicit the same 
structural degrees of freedom. For this reason, we also report results for an 
analysis involving load cases (LI) and L(3), where load case (L3) corresponds to 
a non-uniformly distributed shearing load at the non-clamped end of the panel. 
Clearly, load case (L3) excites many modes of the structure that are not excited 
by load case (LI), which explains the slight degradation in performance reported 
in Table 4. However, the overall conclusions drawn for the combination (L1-L2) 
are shown to hold for the combination (L1-L3). 
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Table 4 

Stiffened wing panel from the V22 tiltrotor aircraft 
Performance of the solution method for load case (L3) 


# of stored (LI) directions (% of 359) 

0 

20 

40 

60 

80 

100 

Total memory requirements (MW) 

15.0 

14.9 

15.7 

15.4 

15.3 

15.3 

# of iterations 

380 

310 

240 

165 

115 

85 

CPU time for orthogonalizations (secs) 

5.5 

5.4 

4.9 

3.8 

3.1 

2.6 

Total CPU time (secs) 

126.0 

105.0 

82.0 

57.0 

40.0 

30.5 


5.2. Implicit linear dynamic analysis 

Next, we consider the transient analysis via an implicit time-integration 
scheme of a line-pinched membrane with a circular hole. The membrane is dis- 
cretized in 5680 4-node elements and 11640 degrees of freedom (Fig. 4). The finite 
element mesh is partitioned into 32 subdomains. The size of the dual interface 
problem is 1892 — that is, 16.25% of the size of the global problem. The transient 
analysis is carried out on a 32 processor iPSC-860 system. After all of the usual 
finite element storage requirements are allocated, there is enough memory left to 
store a total number of 891 search directions. This number corresponds to 47% 
of the size of the dual interface problem. 
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Fig. 4. Line-pinched membrane with a circular hole 

The system of equations arising at the first time step is solved in 322 it- 
erations using the FETI transient method [12], After 3 time steps, 435 search 
directions are accumulated (Fig. 5) and only 20 iterations are needed for solving 
the fourth linear system of equations (Fig. 6). After 16 time steps, the total num- 
ber of accumulated search directions is only 536 — that is, only 28% of the size 
of the dual interface problem, and superconvergence is triggered: all subsequent 
time steps are solved in 2 or 3 iterations (Fig. 7) and in less than 1.0 second CPU 

(Fig. 8). 

When a direct solver is applied to the above problem, at each time step, the 
pair of forward/ back ward substitutions consumes 15.0 seconds on the same 32 
processor iPSC-860. Therefore, the proposed solution methodology is clearly an 
excellent alternative to repeated forward /backward substitutions on distributed 
memory parallel processors. 
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Fig. 5. Accumulation of search directions 
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IMPLICIT LINEAR DYNAMICS 
Pinched Membrane with a Hole 



Fig. 6. Convergence rate history 
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Fig. 8. CPU history 
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6. Conclusion 


In this paper, we have presented a methodology for extending the range of 
applications of domain decomposition based iterative methods to problems with 
multiple or repeated right hand sides. Such problems arise, for example, in multi- 
ple load static analyses, in implicit linear dynamics, in eigenvalue problems, and 
in many other structural computations. We have formulated the global prob- 
lem as a series of minimization problems over if -orthogonal and supplementary 
subspaces, and have tailored the preconditioned conjugate gradient algorithm to 
solve them efficiently. The resulting solution method is scalable in the fine gran- 
ularity regime targeted by emerging parallel processors, whereas direct factor- 
ization schemes and forward and backward substitution algorithms are not. We 
have illustrated the proposed methodology with the solution of realistic static 
and dynamic structural problems, and have highlighted its potential to outper- 
form forward and backward substitutions on parallel computers. The proposed 
methodology enhances the versatility of domain decomposition based iterative 
algorithms. 
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